Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms
نویسنده
چکیده
Big data is often mined using clustering algorithms. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a popular spatial clustering algorithm. However, it is computationally expensive and thus for clustering big data, parallel processing is required. The two prevalent paradigms for parallel processing are High-Performance Computing (HPC) based on Message Passing Interface (MPI) or Open Multi-Processing (OpenMP) and the newer big data frameworks such as Apache Spark or Hadoop. This report surveys for these two different paradigms publicly available implementations that aim at parallelizing DBSCAN and compares their performance. As a result, it is found that the big data implementations are not yet mature and in particular for skewed data, the implementation’s decomposition of the input data into parallel tasks has a huge influence on the performance in terms of running time.
منابع مشابه
A density based clustering approach to distinguish between web robot and human requests to a web server
Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...
متن کاملExperiments in Parallel Clustering with DBSCAN
We present a new result concerning the parallelisation of DBSCAN, a Data Mining algorithm for density-based spatial clustering. The overall structure of DBSCAN has been mapped to a skeletonstructured program that performs parallel exploration of each cluster. The approach is useful to improve performance on high-dimensional data, and is general w.r.t. the spatial index structure used. We report...
متن کاملApplication of Soft Computing Methods for the Estimation of Roadheader Performance from Schmidt Hammer Rebound Values
Estimation of roadheader performance is one of the main topics in determining the economics of underground excavation projects. The poor performance estimation of roadheader scan leads to costly contractual claims. In this paper, the application of soft computing methods for data analysis called adaptive neuro-fuzzy inference system- subtractive clustering method (ANFIS-SCM) and artificial neu...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملبررسی مشکلات الگوریتم خوشه بندی DBSCAN و مروری بر بهبودهای ارائهشده برای آن
Clustering is an important knowledge discovery technique in the database. Density-based clustering algorithms are one of the main methods for clustering in data mining. These algorithms have some special features including being independent from the shape of the clusters, highly understandable and ease of use. DBSCAN is a base algorithm for density-based clustering algorithms. DBSCAN is able to...
متن کامل